set more off

* Import branch-level IPEDS data and aggregate it to the system level so we can merge with the system-level Title IV data.

	* Information about identifiers: 
	* UNITID is the IPEDS branch-level identifier. Observations in IPEDS are uniquently identified by the combination of UNITID-awardyear.
	* For colleges that partipicate in federal student aid programs (virtually all of them), the DOE assigns a different identifier OPEID.
	* The first six digits of OPEID identify the plan or agreement under which the college participates, the last two digits represent the specific participant college within that plan.
	* The convention is that the ``parent'' participant within each plan (typically the biggest one) is denoted 00, and other participants are numbered 01, 02, ...
	* The OPEID concept of a plan participant appears to be very close to the UNITID concept of a branch, but not quite identical,
	* because there are a (very small) number of UNITIDs that share the same 8-digit OPEID identifier (less than 1% in each year).
	* The more important thing for us is, the Title IV data on loan volumes are only reported at the level of the plan, i.e. the six-digit OPEID, so we can only merge at that level.
	* Furthermore, even in IPEDS, some colleges seem to report financial numbers aggregated at the plan level only
	* (i.e. they report combined figures for the parent participant that has code 00, and missing values for each other participant).
	* So, as seems to be standard in the literature, we will aggregate the IPEDS data up to six-digit OPEID for all of our analysis.
	* To do this, we average sticker tuition across each participant college weighting by enrollment, 
	* sum all enrollment and financial numbers across participants,
	* use the state of location of the largest branch,
	* and treat the entire plan as for-profit if any individual participant reports itself as such.
	* In our final regression sample, about 90\% of observations are from plans with only one participant, and our general conclusions are unchanged if we use only these observations.
	* See further discussion at Robert Kelchen's website, https://robertkelchen.com/2017/08/21/beware-opeids-and-super-opeids/

	use ./Data/IPEDS/IPEDS, clear
	isid unitid awardyear

	drop if opeid == ""	
	g opeid6 = substr(opeid,1,6)
	g opeid8 = substr(opeid,7,2)

	* Keep only private colleges, using the IPEDS "institutional control/affiliation" code cntlaffi.
		keep if inlist(cntlaffi,2,3,4)

	* Label for-profit colleges
		g forprofit = cntlaffi == 2

	* Aggregate information across multiple branches with the same OPEID6 to be able to merge with TitleIV data:
	 	* Average tuition across branches, weighted by branch-specific full-year enrollment:
	 	bysort opeid6 awardyear: egen avg_chg1ay3 = wtmean(chg1ay3), weight(efytotlt)
	 	drop chg1ay3
	 	rename avg_chg1ay3 chg1ay3
		* Outcomes to sum across branches: These are all total dollar amounts, numbers of students, or amounts of instructional activity
		local balance_sheet_vars net_assets total_equity 
		local revenue_vars tuition_fees auxent_revenue fedl_apgtcx statelocal_apgtcx pvgifts_grants_contracts educational_sales hospital_revenue other_revenue affiliate_contributions investment_return independent_ops_revenue 
		local expense_vars instrl_expense rsrch_svc_expense support_expense auxent_expense netgrantaid_expense hospsvc_expense plant_expense other_expense indops_expense 
		local sumvars total_revenues total_expenses `revenue_vars' `expense_vars' `balance_sheet_vars' efytotlt grn4n2 grn4n12 grn4n22 cdactua
		foreach var of varlist `sumvars' {
			bysort opeid6 awardyear: egen sum_`var' = sum(`var')
			* Delete zero values that are substituted by Stata's egen sum function when all values are missing:
				bysort opeid6 awardyear: egen max_`var' = max(`var')
				replace sum_`var' = . if max_`var' == .
			drop `var' max_`var'
			rename sum_`var' `var'
		}

	* Now that we have aggregated by opeid6-awardyear, keep only the primary participant college in each plan 
		* (which is thus implicitly used to classify for-profit status and state of location)
		bysort opeid6 awardyear (opeid8): g num_branches = _N // this is just for keeping track of how many branches we dropped
		bysort opeid6 awardyear (opeid8): keep if _n == 1
		keep opeid6 awardyear stabbr forprofit chg1ay3 num_branches `sumvars'

 	* Statecode is a program I wrote to assign numeric codes 1-50 to states, making them easier to work with
	 	statecode stabbr postal 
	 	drop stabbr
	 	rename statecode state

	* Construct fraction of aid recipients from low-income backgrounds:
		* The denominator is first-time, full-time, degree/certificate seeking undergraduates who received Title IV federal student aid (labeled "Group 4" on the SFA form).
		* This is the only group for which the SFA form breaks down HH income levels.
		* The numerator is the number of students in "Group 4" whose households have, respectively, $0-30k and $30k-48k of annual income.
		g frac_low_income = (grn4n12+grn4n22)/grn4n2

	tempfile IPEDS
	save `IPEDS'

* Import annual TitleIV data and merge onto IPEDS

	use ./Data/TitleIV/TitleIV, clear
	drop if substr(opeid,7,2) ~= "00" // just one observation that ultimately doesn't enter our sample
	g opeid6 = substr(opeid,1,6)
	isid opeid6 awardyear

	egen ugrad_volloans = rowtotal(DL_ugrad_*_volloans)
		* This one is only reliable after the switch to DL in 2010,
		* because the FL program always reported Stafford loans on a combined grad+undergrad basis
		replace ugrad_volloans = . if awardyear < 2010
	egen ugrad_plus_volloans = rowtotal(FL_ugrad_plus_volloans DL_ugrad_plus_volloans)
	egen ugrad_plus_numreps = rowtotal(FL_ugrad_plus_numreps DL_ugrad_plus_numreps)

	keep opeid6 awardyear ugrad_volloans ugrad_plus_volloans ugrad_plus_numreps 

	merge 1:1 opeid6 awardyear using `IPEDS', nogen keep(3)

	destring opeid6, replace
	xtset opeid6 awardyear

	tempfile IPEDS_TitleIV
	save `IPEDS_TitleIV'

* Import Pell grant data and merge onto the above dataset
	insheet using ./Data/Pell_Grants/Pell_Grant.csv, clear
	rename ope_id opeid
	replace opeid = subinstr(opeid,"`=char(9)'","",.)
	drop if inlist(opeid,"00000000","000000NA")
	rename award_year awardyear
	isid opeid awardyear
	g opeid6 = substr(opeid,1,6)
	destring opeid6, replace
	foreach var of varlist pell_recipients pell_disbursements {
		replace `var' = "" if `var' == "NA"
		destring `var', replace
		rename `var' `var'_branch
		bysort opeid6 awardyear: egen `var' = sum(`var'_branch)
	}
	by opeid6 awardyear: keep if _n == 1
	keep opeid6 awardyear pell_recipients pell_disbursements
	tempfile Pell
	save `Pell'

	use `IPEDS_TitleIV', clear
	merge 1:1 opeid6 awardyear using `Pell', nogen keep(1 3)

save ./Data/IPEDS_TitleIV_merged, replace
